Last modified: 15 November 2023
Give names and emails of group members here:
PLEASE DELETE THIS WHOLE CELL BEFORE SUBMITTING YOUR PROJECT
The purpose of this assignment is to develop your skills in organising and presenting a Data Science project.
Since most of the marks will be awarded for organisation and presentation, it is suggested that you do not initially attempt anything too complicated. However, once you have managed to get a basic pipeline working that fits the guidelines, you are encouraged to extend and elaborate your analysis.
Your project should entirely be contained within this template file. You should keep the basic structure indicated below. To facilitate grading according to the marking scheme.
You may import any module that is provided with Anaconda3 Python.
The marking scheme is as follows:
Project Plan:
Program Code: (15)
Code should be laid out in steps with explanations
and intermediate output with comments.
You should ensure that the steps do not require
a large amount of processing time.
Project Outcome:
You can use any data you like. Many useful resources are available.
The Data Resources section of the module (Unit 4.3 on Minerva) has links to several example data sets.
As a starting point you could browse the following:
Please use this notebook as a template for your project file. In the following cells of the notebook, italic text giving explanations and examples should be either deleted, or, in most cases, replaced by appropriate text describing your project. Text that is not in italic (which is mostly headings) should be left as it is. Your project report notebook should the same overall structure as this template notebook. An exception to this is the current markup cell describing the project requiements. You should delete this before submitting your notebook.
Youtube provides an API that provides information about trendng data country-wise. The dataset on Kaggle is a real-time (daily) updating dataset derived from the API consisting of attruibutes for various contries which is used for analysis in this project. The scope of this project limits our use to a specific daterange and limited the country to Great Britain.
The dataset consists of two files , _GB_category_id.json and GB_youtube_trendingdata.csv. The files contain the follwoing column ID's
['video_id', 'title', 'publishedAt', 'channelId', 'channelTitle', 'categoryId', 'trending_date', 'tags', 'view_count', 'likes', 'dislikes', 'comment_count', 'thumbnail_link', 'comments_disabled', 'ratings_disabled', 'description']
The json file contains a data structure that links the categoryId column in each file. The json file alos ha sinformation about each files category i.e ['family','Entertainment','Education']. The structure of the file is as follows
{
"kind": "youtube#videoCategoryListResponse",
"etag": "kBCr3I9kLHHU79W4Ip5196LDptI",
"items": [
{
"kind": "youtube#videoCategory",
"etag": "IfWa37JGcqZs-jZeAyFGkbeh6bc",
"id": "1",
"snippet": {
"title": "{{category_string}}",
"assignable":{{boolean}},
"channelId": "{{string}}"
}
},
{
"kind": "youtube#videoCategory",
"etag": "5XGylIs7zkjHh5940dsT5862m1Y",
"id": "2",
"snippet": {
"title": "{{category_string}}",
"assignable": {{boolean}},
"channelId": "{{string}}"
}
},
]
}
The primary objective of this project is to perform an in-depth analysis of the "Trending Youtube API Dataset" to identify and understand the important attributes that results in the uploaded video to be "trending" in the platform. The project aims to find the patterns in the dataset that would push the videos, uploaded by the users, into the trending category; thus helping the content creators to optimize their video uploads accordingly that would contribute to the Youtube Algorithm.
The aim of the project is achieved by making use of data analysis techniques and machine learning algorithms to find out the correlations between the different attributes in the dataset. We will make use of columns such as {categoryId, title,view_count,likes,dislikes etc.} to recognize the patterns resulting in a trending video. After thorough analysis of the various aspects, we aim to discern the commonalities among the trending videos, while also identifying the factors that may vary across the different genres of Youtube.
In summary, the project aims to provide insight into the dynamics of Youtube Trending videos, thereby helping content creators in sharing content that would satisfy the Youtube algorithm and results in the video being in the Youtube Trending. The results will help in improving the experience of the users and Youtubers on the Youtube Platform.
around 200-300 words._
their characteristic features (which mushrooms are poisonous?) to simulating an evolving process (will the rabbits eat all the carrots or get eaten by the foxes?)_
that you could implement:_
You should chose and list up to 4 specific objectives suited to the data you will be working with and the type of project you wish to carry out. There should be at least one per person doing the project. There is no need for the objectives them to be completely different. They could be different stages of the processing requirements, or different processing functions that the system provides. Or just different aspects of data analysis that will be conducted. Typically, it is expected that there would be one objective per person. Replace the following examples with your own objectives:
Describe your code in terms of the following two sections.
Typically this would be a pipeline in which data goes through several stages of transformation and analysis, but other architectures are possible. This does not need to be particularly complicated. A simple diagram with 100-150 words of explanation would be a good way to present your architecture.
Briefly list and describe the most significant computational components of your system and the algorithms you will use to implement them. This could include things like:
Your list can be presented in similar form to the one just given, but should include a brief but more specific description of the components and/or algorithms. Probably three or four components is sufficient for most projects, but you may want to have more.
We are importing pandas since it is used for DataFrame operations. Along with Pandas, we are also using the built-in JSON to import our GB_category_id.json
import pandas as pd
import json
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
In the following code block, we read the CSV as dfc and transform it into a pandas data frame called df. We then read the JSON file into a variable called categories. We then select the category title from categories and append it into a new column called category in the df data frame.
We use the head() method to see how our data frame looks.
dfc= pd.read_csv('GB_youtube_trending_data.csv')
df = pd.DataFrame(dfc)
# Load the categories
with open('GB_category_id.json') as f:
categories = json.load(f)
# Create a dictionary to map category IDs to category names
category_dict = {int(item['id']): item['snippet']['title'] for item in categories['items']}
# Map the category IDs in the dataframe to the category names
df['category'] = df['categoryId'].map(category_dict)
df.head()
| video_id | title | publishedAt | channelId | channelTitle | categoryId | trending_date | tags | view_count | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | description | category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | J78aPJ3VyNs | I left youtube for a month and THIS is what ha... | 2020-08-11T16:34:06Z | UCYzPXprvl5Y-Sf0g4vX-m6g | jacksepticeye | 24 | 2020-08-12T00:00:00Z | jacksepticeye|funny|funny meme|memes|jacksepti... | 2038853 | 353790 | 2628 | 40228 | https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg | False | False | I left youtube for a month and this is what ha... | Entertainment |
| 1 | 9nidKH8cM38 | TAXI CAB SLAYER KILLS 'TO KNOW HOW IT FEELS' | 2020-08-11T20:00:45Z | UCFMbX7frWZfuWdjAML0babA | Eleanor Neale | 27 | 2020-08-12T00:00:00Z | eleanor|neale|eleanor neale|eleanor neale true... | 236830 | 16423 | 209 | 1642 | https://i.ytimg.com/vi/9nidKH8cM38/default.jpg | False | False | The first 1000 people to click the link will g... | Education |
| 2 | M9Pmf9AB4Mo | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11T17:00:10Z | UC0ZV6M2THA81QT9hrVWJG3A | Apex Legends | 20 | 2020-08-12T00:00:00Z | Apex Legends|Apex Legends characters|new Apex ... | 2381688 | 146739 | 2794 | 16549 | https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg | False | False | While running her own modding shop, Ramya Pare... | Gaming |
| 3 | kgUV1MaD_M8 | Nines - Clout (Official Video) | 2020-08-10T18:30:28Z | UCvDkzrj8ZPlBqRd6fIxdhTw | Nines | 24 | 2020-08-12T00:00:00Z | Nines|Trapper of the year|Crop Circle|Nines Tr... | 613785 | 37567 | 669 | 2101 | https://i.ytimg.com/vi/kgUV1MaD_M8/default.jpg | False | False | Nines - Clout (Official Video)Listen to Clout ... | Entertainment |
| 4 | 49Z6Mv4_WCA | i don't know what im doing anymore | 2020-08-11T20:24:34Z | UCtinbF-Q-fVthA0qrFQTgXQ | CaseyNeistat | 22 | 2020-08-12T00:00:00Z | [None] | 940036 | 87113 | 1860 | 7052 | https://i.ytimg.com/vi/49Z6Mv4_WCA/default.jpg | False | False | ssend love to my sponsor; for a super Limited ... | People & Blogs |
Since the data frame looks fine, we are ready for the data exploration part.
df.shape
(238191, 17)
We see that there are 17 features (columns) and 238191 trending videos (rows). To ensure correct analysis, we check if we do not have null items in any column by using isna(), which returns a boolean value and then sums it column-wise.
df.isna().sum()
video_id 0 title 0 publishedAt 0 channelId 0 channelTitle 0 categoryId 0 trending_date 0 tags 0 view_count 0 likes 0 dislikes 0 comment_count 0 thumbnail_link 0 comments_disabled 0 ratings_disabled 0 description 4279 category 102 dtype: int64
The above output shows that 4279 missing descriptions and 102 missing categories. Since missing values in these columns will not cause problems in our analysis, we will not drop those rows.
We also check for duplicate rows in the data frame using the pandas' duplicated() function, as duplicate values could lead to bias. We then sum up the total to see how many rows have duplicates. From the result, we see that 124 rows have duplicates.
df.duplicated().sum()
124
We now use the drop_duplicates() function to drop those rows. We can then check the data frame's shape to cross-verify the deletion of duplicates.
df = df.drop_duplicates()
df.shape
(238067, 17)
Since the data is clean, we will start exploring the data and working on our objectives.
Firstly, we will explore which category of video trends to know what kind of videos the viewers are intrigued to watch. So, we plot a barplot that shows the number of videos trending in each category.
df['category'].value_counts().plot.bar(title='Figure 1 - No. of Videos per Category',xlabel='Category',ylabel='Number of Trending Videos',color='royalblue')
<Axes: title={'center': 'Figure 1 - No. of Videos per Category'}, xlabel='Category', ylabel='Number of Trending Videos'>
As we can see from Figure 1, Entertainment tops the list, followed by Sports, Gaming, Music, and People & Blogs as the Top 5 Categories across the dataset for the year between Aug 2020 and Nov 2023.
Now that we know the ranking of the categories that trend, we further investigate the nature of the trending categories month-wise. We can do this by converting the trending_date column in the data frame into the date-time datatype and storing it in a new column called date. Further columns are created based on year and month to aid in plotting. We can then label the month from integers to their names in English to enhance clarity. A graph is then plotted based on category and month.
# plot bar graph based on category and date by month
df['date'] = pd.to_datetime(df['trending_date'])
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['month'] = df['month'].replace((1,2,3,4,5,6,7,8,9,10,11,12),('Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'))
df['month'] = pd.Categorical(df['month'], categories=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec'], ordered=True)
#plot bar graph based on category and date by month
df.groupby(['month','category']).size().unstack().plot(kind='bar',stacked=True,figsize=(18,12),title="Figure 2 - No. of Videos per Category by Month",xlabel="Month",ylabel='No. of Trending Videos').legend(bbox_to_anchor=(1.0, 1.0),title="Category")
<matplotlib.legend.Legend at 0x154bae150>
We are using a stacked barplot to compare the number of trending videos each month. As shown in Figure 2, October has the highest number of trending videos, followed by September and August. The least number of trending videos is for February. The top 5 categories from Figure 1 have almost a similar spread across the plot.
The spike in the number of trending videos between August and November may be because of the period range we have in our dataset. That is, our range is between August 2020 to November 2023. This incomplete data in 2020 and 2023 could be a reason to see the spike. We will further probe into finding how big of a difference it is between each year by comparing only the top 5 categories.
The data frame is further filtered to the Top 5 categories per Figure 1.
# plot grouped bar graph for top 5 category per year
df.groupby(['year','category']).size().groupby(level=0).nlargest(5).unstack().plot(kind='bar',stacked=False,figsize=(15,10),xlabel="Year",ylabel='No. of Trending Videos',title="Figure 3 - Top 5 Categories per Year").legend(bbox_to_anchor=(1.0, 1.0),title="Category")
plt.xticks(np.array([0,1,2,3]), ('2020','2021','2022','2023'),rotation=0)
([<matplotlib.axis.XTick at 0x154f83710>, <matplotlib.axis.XTick at 0x154f8a190>, <matplotlib.axis.XTick at 0x154febd50>, <matplotlib.axis.XTick at 0x15717fb90>], [Text(0, 0, '2020'), Text(1, 0, '2021'), Text(2, 0, '2022'), Text(3, 0, '2023')])
Figure 3 shows that in 2020, there were fewer trending videos than in any other year. As for 2023, the number of trending videos is also less compared to 2021 and 2022. This shows that since we have incomplete data for 2020 and 2023, the height is reduced, which made it spike between August and November in Figure 2.
To aide in our understanding of how keywords in title affect a videos abilty to trend, we use an external python module called worldcloud. This will allow us to visualize frequently mentioned words in the title.
from wordcloud import WordCloud
df.head()
| video_id | title | publishedAt | channelId | channelTitle | categoryId | trending_date | tags | view_count | likes | dislikes | comment_count | thumbnail_link | comments_disabled | ratings_disabled | description | category | date | month | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | J78aPJ3VyNs | I left youtube for a month and THIS is what ha... | 2020-08-11T16:34:06Z | UCYzPXprvl5Y-Sf0g4vX-m6g | jacksepticeye | 24 | 2020-08-12T00:00:00Z | jacksepticeye|funny|funny meme|memes|jacksepti... | 2038853 | 353790 | 2628 | 40228 | https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg | False | False | I left youtube for a month and this is what ha... | Entertainment | 2020-08-12 00:00:00+00:00 | Aug | 2020 |
| 1 | 9nidKH8cM38 | TAXI CAB SLAYER KILLS 'TO KNOW HOW IT FEELS' | 2020-08-11T20:00:45Z | UCFMbX7frWZfuWdjAML0babA | Eleanor Neale | 27 | 2020-08-12T00:00:00Z | eleanor|neale|eleanor neale|eleanor neale true... | 236830 | 16423 | 209 | 1642 | https://i.ytimg.com/vi/9nidKH8cM38/default.jpg | False | False | The first 1000 people to click the link will g... | Education | 2020-08-12 00:00:00+00:00 | Aug | 2020 |
| 2 | M9Pmf9AB4Mo | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11T17:00:10Z | UC0ZV6M2THA81QT9hrVWJG3A | Apex Legends | 20 | 2020-08-12T00:00:00Z | Apex Legends|Apex Legends characters|new Apex ... | 2381688 | 146739 | 2794 | 16549 | https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg | False | False | While running her own modding shop, Ramya Pare... | Gaming | 2020-08-12 00:00:00+00:00 | Aug | 2020 |
| 3 | kgUV1MaD_M8 | Nines - Clout (Official Video) | 2020-08-10T18:30:28Z | UCvDkzrj8ZPlBqRd6fIxdhTw | Nines | 24 | 2020-08-12T00:00:00Z | Nines|Trapper of the year|Crop Circle|Nines Tr... | 613785 | 37567 | 669 | 2101 | https://i.ytimg.com/vi/kgUV1MaD_M8/default.jpg | False | False | Nines - Clout (Official Video)Listen to Clout ... | Entertainment | 2020-08-12 00:00:00+00:00 | Aug | 2020 |
| 4 | 49Z6Mv4_WCA | i don't know what im doing anymore | 2020-08-11T20:24:34Z | UCtinbF-Q-fVthA0qrFQTgXQ | CaseyNeistat | 22 | 2020-08-12T00:00:00Z | [None] | 940036 | 87113 | 1860 | 7052 | https://i.ytimg.com/vi/49Z6Mv4_WCA/default.jpg | False | False | ssend love to my sponsor; for a super Limited ... | People & Blogs | 2020-08-12 00:00:00+00:00 | Aug | 2020 |
We created a df_test3 variable containing df(main data set) where the lengths of the titles are stored in a new column name title_lenght using the .split() function to split them based on spaces and count their size using .len() function. And displayed the mean of title_lenght to know the average length of the title.
#find the average length of title in words
df_objective2 = df.copy()
df_objective2['title_length'] = df_objective2['title'].str.split().str.len()
df_objective2['title_length'].mean()
9.045142753930616
We plot a histogram for title_lenght with a bin size of 20. It explains to divide into 20 equal intervals where the data points are collected in the given interval and labeled with the title and their axis(x="Title Length," y="Frequency"), names
df_objective2['title_length'].plot.hist(bins=20,figsize=(10,5),title="Figure 4 - Distribution of Title Length",xlabel="Title Length",ylabel="Frequency",color='royalblue')
<Axes: title={'center': 'Figure 4 - Distribution of Title Length'}, xlabel='Title Length', ylabel='Frequency'>
Here, the plot makes more sense due to a higher frequency value closer to 10, which is assumed to be a normal distribution with some outliers.
here we are creating a wordcloud to show the trending video title by category .firstly we created a fig and axes variable with subplots of 7 rows and 2 column . and created categoris variable to contail cotegory names
# Multiple graphs in the same cell
fig, axes = plt.subplots(7, 2, figsize=(15, 30))
# Get unique categories
categories = df_objective2['category'].unique()
# Limit the number of categories to the number of subplots
categories = categories[:len(axes.flatten())]
# Plot the wordcloud for each category
for i, ax in enumerate(axes.flatten()):
category = categories[i]
title = df_objective2[df_objective2['category'] == category]['title']
text = ' '.join(title)
if text != ' ':
wordcloud = WordCloud(max_font_size=50, max_words=10000, background_color="white").generate(text)
ax.imshow(wordcloud, interpolation="bilinear")
ax.set_title(category)
ax.axis("off")
fig.suptitle('Figure 5 - Wordcloud of Trending Video Titles by Category',fontsize=20)
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
/var/folders/y6/rbs796bx78jcl293lzj57pnm0000gn/T/ipykernel_55561/3958442688.py:21: UserWarning: The figure layout has changed to tight fig.tight_layout(rect=[0, 0.03, 1, 0.95])
df_objective2_tags = df.copy()
df_objective2_tags['tag_length'] = df_objective2['tags'].str.split().str.len()
df_objective2_tags['tag_length'].mean()
17.73617511036809
# Multiple graphs in the same cell
fig, axes = plt.subplots(7, 2, figsize=(15, 30))
# Get unique categories
categories = df_objective2['category'].unique()
# Limit the number of categories to the number of subplots
categories = categories[:len(axes.flatten())]
# Plot the wordcloud for each category
for i, ax in enumerate(axes.flatten()):
category = categories[i]
tags = df_objective2[df_objective2['category'] == category]['tags']
text = ' '.join(tags)
if text != ' ' or text == 'None':
wordcloud = WordCloud(max_font_size=50, max_words=10000, background_color="white").generate(text)
ax.imshow(wordcloud, interpolation="bilinear")
ax.set_title(category)
ax.axis("off")
fig.suptitle('Figure 6 - Wordcloud of Trending Video tags by Category',fontsize=20)
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
/var/folders/y6/rbs796bx78jcl293lzj57pnm0000gn/T/ipykernel_55561/3948388135.py:21: UserWarning: The figure layout has changed to tight fig.tight_layout(rect=[0, 0.03, 1, 0.95])
def wordcloud_gen_v2(category):
text1 = ' '.join(df_objective2[df_objective2['category'] == category]['title'])
text2 = ' '.join(df_objective2[df_objective2['category'] == category]['tags'])
from collections import Counter
# Generate word frequency dictionaries
word_freq1 = Counter(text1.split())
word_freq2 = Counter(text2.split())
# Find common words
common_words = word_freq1 & word_freq2
# Generate a word cloud using only the common words
common_text = ' '.join(common_words.elements())
if common_text != ' ' or common_text == 'None':
wordcloud = WordCloud(max_font_size = 50,max_words=10000, background_color="white").generate(common_text)
return wordcloud
# Multiple graphs in the same cell
fig, axes = plt.subplots(7, 2, figsize=(15, 30))
# Get unique categories
categories = df_objective2['category'].unique()
# Limit the number of categories to the number of subplots
categories = categories[:len(axes.flatten())]
# Plot the wordcloud for each category
for i, ax in enumerate(axes.flatten()):
category = categories[i]
#tags = df_test3[df_test3['category'] == category]['tags']
#text = ' '.join(tags)
if text != ' ' or text == 'None':
ax.imshow(wordcloud_gen_v2(category), interpolation="bilinear")
ax.set_title(category)
ax.axis("off")
fig.suptitle('Figure 7 - Wordcloud of Common Words in Trending Video Tags and Title by Category',fontsize=20)
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
/var/folders/y6/rbs796bx78jcl293lzj57pnm0000gn/T/ipykernel_55561/3111490073.py:21: UserWarning: The figure layout has changed to tight fig.tight_layout(rect=[0, 0.03, 1, 0.95])
We have to see all the columns but use only the ones we need for objective 3. Thus, we use the column method that returns the column names in the data frame.
#find column names of dataframe
df.columns
#As you can see in the figure 3, we have taken the top five trending categories and grouped it yearwise.
#We can see a downtrend in Entertainment from year 2021 to year 2023, while Gaming categories uprises and Music, People & Blogs and Sports have almost stayed constant.
Index(['video_id', 'title', 'publishedAt', 'channelId', 'channelTitle',
'categoryId', 'trending_date', 'tags', 'view_count', 'likes',
'dislikes', 'comment_count', 'thumbnail_link', 'comments_disabled',
'ratings_disabled', 'description', 'category', 'date', 'month', 'year'],
dtype='object')
Since we now have 20 columns, we drop the unnecessary columns to explore the data using the drop() method.
#drop columns that are not needed - video_id, channelId, tags, channelTitle, comment_count, thumbnail_link, comments_disabled, ratings_disabled, date, month, year
df_objective3 = df.copy()
df_objective3 = df.drop(['video_id','channelId','tags','channelTitle','comment_count','thumbnail_link','comments_disabled','ratings_disabled','date','month','year','description'], axis=1)
df_objective3.head()
#dfme = df_ob3[df_ob3['category'] == 'Education'].sort_values(by='likes', ascending=False)
#dfme.head(30)
| title | publishedAt | categoryId | trending_date | view_count | likes | dislikes | category | |
|---|---|---|---|---|---|---|---|---|
| 0 | I left youtube for a month and THIS is what ha... | 2020-08-11T16:34:06Z | 24 | 2020-08-12T00:00:00Z | 2038853 | 353790 | 2628 | Entertainment |
| 1 | TAXI CAB SLAYER KILLS 'TO KNOW HOW IT FEELS' | 2020-08-11T20:00:45Z | 27 | 2020-08-12T00:00:00Z | 236830 | 16423 | 209 | Education |
| 2 | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11T17:00:10Z | 20 | 2020-08-12T00:00:00Z | 2381688 | 146739 | 2794 | Gaming |
| 3 | Nines - Clout (Official Video) | 2020-08-10T18:30:28Z | 24 | 2020-08-12T00:00:00Z | 613785 | 37567 | 669 | Entertainment |
| 4 | i don't know what im doing anymore | 2020-08-11T20:24:34Z | 22 | 2020-08-12T00:00:00Z | 940036 | 87113 | 1860 | People & Blogs |
We now check the relation between likes, views, and comment count using scatterplot.
columns = ['likes', 'view_count', 'comment_count']
plt.figure(figsize=(16, 18))
plt.suptitle('Figure 8 - Scatterplot of Likes, View Count and Comment Count',fontsize=20)
for i in range(len(columns)):
for j in range(i+1, len(columns)):
plt.subplot(len(columns), len(columns), i*len(columns) + j + 1)
sns.scatterplot(x=np.log(df[columns[i]]), y=np.log(df[columns[j]])).set(title=f'Scatterplot of {columns[i].title()} vs {columns[j].title()}', xlabel=columns[i].title(), ylabel=columns[j].title())
/Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs) /Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs) /Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
In Figure 8, we see that likes, view count, and comment count have a linear relationship, which means if the likes increase, the view count increases, and then the comment count also increases.
We then look deeper into likes and view counts for each category and look for the similarity. However, we also dropped the null rows.
Relation between view count and category
df_objective3 = df_objective3.dropna()
plt.scatter(df_objective3['view_count'],df_objective3['category'].astype(str))
plt.xlabel('View Count [in 100 million]')
plt.ylabel('Category')
plt.title('Figure 9 - View count vs. Category')
plt.show()
Figure 9 shows the view count in millions for each category, where Music has the highest number of views, followed by Entertainment and Education.
Relation between likes and category
plt.scatter(df_objective3['likes'],df_objective3['category'].astype(str))
plt.xlabel('Likes [in 100 million]')
plt.ylabel('Category')
plt.title('Figure 10 - Likes vs. Category')
plt.show()
Figure 10 shows the likes in millions for each category, where Music has the most likes, followed by Entertainment and Education. These two graphs show similar results, so looking from the perspective of view_count alone provides similar results. We chose view count since the values are higher than that of likes.
We then count the number of videos that are higher than the mean of the view count.
df_objective3[df_objective3['view_count'] > df_objective3['view_count'].mean()].count()
title 51028 publishedAt 51028 categoryId 51028 trending_date 51028 view_count 51028 likes 51028 dislikes 51028 category 51028 dtype: int64
We also check the number of videos that have 0 counts by using the count() method.
df_objective3['title'][df_objective3['view_count'] == 0].count()
94
Here, we see 94 videos that have made it to the trending dataset with 0 view count, so these can be considered anomalies. From the graphs, the higher the view count, the more likely it is recommended to users in that particular category than the ones with a lower view count.
We also now count the number of videos higher than the mean of the likes for further investigation.
df_objective3['title'][df_objective3['likes'] > df_objective3['likes'].mean()].count()
49245
We find the maximum value of likes the dataset has using the max() method.
df_objective3['likes'].max()
15246514
We also enquired about the mean of the likes and view count for each category.
df_objective3.groupby('category')['likes'].mean()
category Autos & Vehicles 41388.621114 Comedy 115153.003197 Education 96781.728007 Entertainment 130497.046563 Film & Animation 116169.649217 Gaming 88443.437217 Howto & Style 65767.418930 Music 284905.892479 News & Politics 19181.273624 People & Blogs 91500.477865 Pets & Animals 50546.548057 Science & Technology 96778.695623 Sports 27900.184374 Travel & Events 33063.363850 Name: likes, dtype: float64
df_objective3.groupby('category')['view_count'].mean()
category Autos & Vehicles 9.365756e+05 Comedy 1.841453e+06 Education 1.830250e+06 Entertainment 2.697683e+06 Film & Animation 3.202573e+06 Gaming 1.625279e+06 Howto & Style 1.327054e+06 Music 4.267132e+06 News & Politics 1.419264e+06 People & Blogs 1.792461e+06 Pets & Animals 1.265167e+06 Science & Technology 2.442207e+06 Sports 1.146422e+06 Travel & Events 7.483659e+05 Name: view_count, dtype: float64
Now that we have all the necessary information, we create a new column called likes_mean, where we get a boolean value of 1 if the like count is greater than the mean of the likes; if not, we get 0.
#create a new column where, if the number of likes is greater than the mean, the value is 1, otherwise 0
df_objective3['likes_mean'] = (df_objective3['likes'] > df_objective3['likes'].mean()).astype(int)
df_objective3.head()
| title | publishedAt | categoryId | trending_date | view_count | likes | dislikes | category | likes_mean | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | I left youtube for a month and THIS is what ha... | 2020-08-11T16:34:06Z | 24 | 2020-08-12T00:00:00Z | 2038853 | 353790 | 2628 | Entertainment | 1 |
| 1 | TAXI CAB SLAYER KILLS 'TO KNOW HOW IT FEELS' | 2020-08-11T20:00:45Z | 27 | 2020-08-12T00:00:00Z | 236830 | 16423 | 209 | Education | 0 |
| 2 | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11T17:00:10Z | 20 | 2020-08-12T00:00:00Z | 2381688 | 146739 | 2794 | Gaming | 1 |
| 3 | Nines - Clout (Official Video) | 2020-08-10T18:30:28Z | 24 | 2020-08-12T00:00:00Z | 613785 | 37567 | 669 | Entertainment | 0 |
| 4 | i don't know what im doing anymore | 2020-08-11T20:24:34Z | 22 | 2020-08-12T00:00:00Z | 940036 | 87113 | 1860 | People & Blogs | 0 |
We plot graphs to find out the following: (i) Relation between Likes_mean and category (Figure 11) (ii) Relation between the 50th quantile of likes and category (Figure 12) (iii) Relation between view counts mean and category (Figure 13) (iv) Relation between the 50th quantile of view counts and category (Figure 14)
#plot graph between likes_mean and category
df_objective3.groupby(['likes_mean','category']).size().unstack().plot(kind='bar',stacked=False,figsize=(15,10),xlabel='Likes Mean',ylabel='No. of Trending Videos',title="Figure 11 - Likes Mean vs. Category").legend(bbox_to_anchor=(1.0, 1.0),title="Category").axes.set_xticklabels(['Likes Lesser than Mean', 'Lkes Greater than Mean'],rotation=0)
[Text(0, 0, 'Likes Lesser than Mean'), Text(1, 0, 'Lkes Greater than Mean')]
Figure 17 shows several video trends even if the likes exceed the mean. Let us then take a 50% quantile to better visualize the data.
df_objective3['likes_50'] = (df_objective3['likes'] > df_objective3['likes'].quantile(.5)).astype(int)
df_objective3.groupby(['likes_50','category']).size().unstack().plot(kind='bar',stacked=False,figsize=(15,10),title="Figure 12 - Likes 50% vs. Category",xlabel='Likes 50%',ylabel='No. of Trending Videos').legend(bbox_to_anchor=(1.0, 1.0),title="Category").axes.set_xticklabels(['Likes Lesser than 50%', 'Likes Greater than 50%'],rotation=0)
[Text(0, 0, 'Likes Lesser than 50%'), Text(1, 0, 'Likes Greater than 50%')]
In Figure 12, we see more trending videos above the 50 percentile in the Entertainment, Gaming, and Music categories, while there are more trending videos in Sports below the 50 percentile.
We follow the same procedure to further know about view count.
df_objective3['view_count_mean'] = (df_objective3['view_count'] > df_objective3['view_count'].mean()).astype(int)
df_objective3['view_count_50'] = (df_objective3['view_count'] > df_objective3['view_count'].quantile(.5)).astype(int)
df_objective3.head()
| title | publishedAt | categoryId | trending_date | view_count | likes | dislikes | category | likes_mean | likes_50 | view_count_mean | view_count_50 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I left youtube for a month and THIS is what ha... | 2020-08-11T16:34:06Z | 24 | 2020-08-12T00:00:00Z | 2038853 | 353790 | 2628 | Entertainment | 1 | 1 | 0 | 1 |
| 1 | TAXI CAB SLAYER KILLS 'TO KNOW HOW IT FEELS' | 2020-08-11T20:00:45Z | 27 | 2020-08-12T00:00:00Z | 236830 | 16423 | 209 | Education | 0 | 0 | 0 | 0 |
| 2 | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11T17:00:10Z | 20 | 2020-08-12T00:00:00Z | 2381688 | 146739 | 2794 | Gaming | 1 | 1 | 1 | 1 |
| 3 | Nines - Clout (Official Video) | 2020-08-10T18:30:28Z | 24 | 2020-08-12T00:00:00Z | 613785 | 37567 | 669 | Entertainment | 0 | 1 | 0 | 0 |
| 4 | i don't know what im doing anymore | 2020-08-11T20:24:34Z | 22 | 2020-08-12T00:00:00Z | 940036 | 87113 | 1860 | People & Blogs | 0 | 1 | 0 | 1 |
We will also take a 25 percentile.
df_objective3['view_count_25'] = (df_objective3['view_count'] > df_objective3['view_count'].quantile(.25)).astype(int)
df_objective3.head()
| title | publishedAt | categoryId | trending_date | view_count | likes | dislikes | category | likes_mean | likes_50 | view_count_mean | view_count_50 | view_count_25 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I left youtube for a month and THIS is what ha... | 2020-08-11T16:34:06Z | 24 | 2020-08-12T00:00:00Z | 2038853 | 353790 | 2628 | Entertainment | 1 | 1 | 0 | 1 | 1 |
| 1 | TAXI CAB SLAYER KILLS 'TO KNOW HOW IT FEELS' | 2020-08-11T20:00:45Z | 27 | 2020-08-12T00:00:00Z | 236830 | 16423 | 209 | Education | 0 | 0 | 0 | 0 | 0 |
| 2 | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11T17:00:10Z | 20 | 2020-08-12T00:00:00Z | 2381688 | 146739 | 2794 | Gaming | 1 | 1 | 1 | 1 | 1 |
| 3 | Nines - Clout (Official Video) | 2020-08-10T18:30:28Z | 24 | 2020-08-12T00:00:00Z | 613785 | 37567 | 669 | Entertainment | 0 | 1 | 0 | 0 | 1 |
| 4 | i don't know what im doing anymore | 2020-08-11T20:24:34Z | 22 | 2020-08-12T00:00:00Z | 940036 | 87113 | 1860 | People & Blogs | 0 | 1 | 0 | 1 | 1 |
df_objective3.groupby(['view_count_mean','category']).size().unstack().plot(kind='bar',stacked=False,figsize=(15,10),title="Figure 13 - View Count Mean vs. Category",xlabel='View Count Mean',ylabel='No. of Trending Videos').legend(bbox_to_anchor=(1.0, 1.0),title="Category").axes.set_xticklabels(['View Count Lesser than Mean', 'View Count Greater than Mean'],rotation=0)
[Text(0, 0, 'View Count Lesser than Mean'), Text(1, 0, 'View Count Greater than Mean')]
We see the same trend in Figure 13, just like in Figure 11. Most of the videos trending below the mean of the view count.
df_objective3.groupby(['view_count_50','category']).size().unstack().plot(kind='bar',stacked=False,figsize=(15,10),title="Figure 14 - View Count 50% vs. Category",xlabel='View Count 50%',ylabel='No. of Trending Videos').legend(bbox_to_anchor=(1.0, 1.0),title="Category").axes.set_xticklabels(['View Count Lesser than 50%', 'View Count Greater than 50%'],rotation=0)
[Text(0, 0, 'View Count Lesser than 50%'), Text(1, 0, 'View Count Greater than 50%')]
Figure 14 shows the 50 percentile in view count. Some have the same number of videos on both sides, while more trending Sports videos are less than the 50 percentile, and more Gaming, Entertainment, and Music videos are above the 50 percentile.
We see a massive number of likes and view counts for the top 5 categories compared to the rest.
df_objective3.groupby(['view_count_25','category']).size().unstack().plot(kind='bar',stacked=False,figsize=(15,10),title="Figure 15 - View Count 25% vs. Category",xlabel='View Count 25%',ylabel='No. of Trending Videos').legend(bbox_to_anchor=(1.0, 1.0),title="Category").axes.set_xticklabels(['View Count Lesser than 25%', 'View Count Greater than 25%'],rotation=0)
[Text(0, 0, 'View Count Lesser than 25%'), Text(1, 0, 'View Count Greater than 25%')]
Then, Figure 15 suggests the graph for the 25 percentile, and we see a similar trend here as well.
We further explore the ratio of view count to likes and the ratio of like to view count and save it in two new columns called view_to_like_ratio and like_to_view_ratio respectively.
#find the ratio of view_count to likes
df_objective3['view_to_like_ratio'] = df_objective3['view_count']/df_objective3['likes']
df_objective3['like_to_view_ratio'] = df_objective3['likes']/df_objective3['view_count']
df_objective3.head()
| title | publishedAt | categoryId | trending_date | view_count | likes | dislikes | category | likes_mean | likes_50 | view_count_mean | view_count_50 | view_count_25 | view_to_like_ratio | like_to_view_ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | I left youtube for a month and THIS is what ha... | 2020-08-11T16:34:06Z | 24 | 2020-08-12T00:00:00Z | 2038853 | 353790 | 2628 | Entertainment | 1 | 1 | 0 | 1 | 1 | 5.762890 | 0.173524 |
| 1 | TAXI CAB SLAYER KILLS 'TO KNOW HOW IT FEELS' | 2020-08-11T20:00:45Z | 27 | 2020-08-12T00:00:00Z | 236830 | 16423 | 209 | Education | 0 | 0 | 0 | 0 | 0 | 14.420630 | 0.069345 |
| 2 | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11T17:00:10Z | 20 | 2020-08-12T00:00:00Z | 2381688 | 146739 | 2794 | Gaming | 1 | 1 | 1 | 1 | 1 | 16.230777 | 0.061611 |
| 3 | Nines - Clout (Official Video) | 2020-08-10T18:30:28Z | 24 | 2020-08-12T00:00:00Z | 613785 | 37567 | 669 | Entertainment | 0 | 1 | 0 | 0 | 1 | 16.338409 | 0.061205 |
| 4 | i don't know what im doing anymore | 2020-08-11T20:24:34Z | 22 | 2020-08-12T00:00:00Z | 940036 | 87113 | 1860 | People & Blogs | 0 | 1 | 0 | 1 | 1 | 10.790996 | 0.092670 |
With the help of the ratios, we see that we get a single like for multiple view counts, which means we do not get likes for every view count. This is even true with the reality where a music video is played repeatedly, but the user can like the video only once.
While we are on the subject of likes, we do not use dislikes for our analysis. This is because Youtube's API has stopped registering dislikes count from 2022. We will plot a barplot to visualize if this is true.
# Extract year from 'date' column
df_yearly_dislikes = df.copy()
df_yearly_dislikes['year'] = df_yearly_dislikes['date'].dt.year
df_yearly_dislikes['dislikes_nonzero'] = df['dislikes'].apply(lambda x: 0 if x == 0 else 1)
# Group by year and calculate sum of 'dislikes_zero'
yearly_dislikes_nonzero = df_yearly_dislikes.groupby('year')['dislikes_nonzero'].sum()
# Plot bar graph
yearly_dislikes_nonzero.plot(kind='bar', figsize=(15,10), title="Figure 16 - Yearly Dislikes", xlabel='Year', ylabel='Sum of Dislikes', color='royalblue')
<Axes: title={'center': 'Figure 16 - Yearly Dislikes'}, xlabel='Year', ylabel='Sum of Dislikes'>
Thus, from Figure 17, we can see there are dislikes in 2020 and 2021 alone. The empty bins for 2022 and 2023 suggest no data for dislikes count. The reason behind the dislikes count for 2020 being less than 2021 is that we need the complete data for 2020 (From January to December).
We will now deeply explore into the dataset to find about other variables such as the time and the month the video published if it affects the trend of a video. We use the info() method to see if we have any null items, and see the datatypes and columns.
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 238067 entries, 0 to 238190 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 video_id 238067 non-null object 1 title 238067 non-null object 2 publishedAt 238067 non-null object 3 channelId 238067 non-null object 4 channelTitle 238067 non-null object 5 categoryId 238067 non-null int64 6 trending_date 238067 non-null object 7 tags 238067 non-null object 8 view_count 238067 non-null int64 9 likes 238067 non-null int64 10 dislikes 238067 non-null int64 11 comment_count 238067 non-null int64 12 thumbnail_link 238067 non-null object 13 comments_disabled 238067 non-null bool 14 ratings_disabled 238067 non-null bool 15 description 233793 non-null object 16 category 237965 non-null object 17 date 238067 non-null datetime64[ns, UTC] 18 month 238067 non-null category 19 year 238067 non-null int32 dtypes: bool(2), category(1), datetime64[ns, UTC](1), int32(1), int64(5), object(10) memory usage: 32.5+ MB
Now that there is not any null items, we will convert the publishedAt column to Date-time datatype, since it is in object datatype currently. Then, we will take the published month and hour and append them in the two new columns called num_month and hour respectively.
df['publishedAt'] = pd.to_datetime(df['publishedAt'])
df['num_month'] = df['publishedAt'].dt.month
df['hour'] = df['publishedAt'].dt.hour
df.head()
| video_id | title | publishedAt | channelId | channelTitle | categoryId | trending_date | tags | view_count | likes | ... | thumbnail_link | comments_disabled | ratings_disabled | description | category | date | month | year | num_month | hour | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | J78aPJ3VyNs | I left youtube for a month and THIS is what ha... | 2020-08-11 16:34:06+00:00 | UCYzPXprvl5Y-Sf0g4vX-m6g | jacksepticeye | 24 | 2020-08-12T00:00:00Z | jacksepticeye|funny|funny meme|memes|jacksepti... | 2038853 | 353790 | ... | https://i.ytimg.com/vi/J78aPJ3VyNs/default.jpg | False | False | I left youtube for a month and this is what ha... | Entertainment | 2020-08-12 00:00:00+00:00 | Aug | 2020 | 8 | 16 |
| 1 | 9nidKH8cM38 | TAXI CAB SLAYER KILLS 'TO KNOW HOW IT FEELS' | 2020-08-11 20:00:45+00:00 | UCFMbX7frWZfuWdjAML0babA | Eleanor Neale | 27 | 2020-08-12T00:00:00Z | eleanor|neale|eleanor neale|eleanor neale true... | 236830 | 16423 | ... | https://i.ytimg.com/vi/9nidKH8cM38/default.jpg | False | False | The first 1000 people to click the link will g... | Education | 2020-08-12 00:00:00+00:00 | Aug | 2020 | 8 | 20 |
| 2 | M9Pmf9AB4Mo | Apex Legends | Stories from the Outlands – “Th... | 2020-08-11 17:00:10+00:00 | UC0ZV6M2THA81QT9hrVWJG3A | Apex Legends | 20 | 2020-08-12T00:00:00Z | Apex Legends|Apex Legends characters|new Apex ... | 2381688 | 146739 | ... | https://i.ytimg.com/vi/M9Pmf9AB4Mo/default.jpg | False | False | While running her own modding shop, Ramya Pare... | Gaming | 2020-08-12 00:00:00+00:00 | Aug | 2020 | 8 | 17 |
| 3 | kgUV1MaD_M8 | Nines - Clout (Official Video) | 2020-08-10 18:30:28+00:00 | UCvDkzrj8ZPlBqRd6fIxdhTw | Nines | 24 | 2020-08-12T00:00:00Z | Nines|Trapper of the year|Crop Circle|Nines Tr... | 613785 | 37567 | ... | https://i.ytimg.com/vi/kgUV1MaD_M8/default.jpg | False | False | Nines - Clout (Official Video)Listen to Clout ... | Entertainment | 2020-08-12 00:00:00+00:00 | Aug | 2020 | 8 | 18 |
| 4 | 49Z6Mv4_WCA | i don't know what im doing anymore | 2020-08-11 20:24:34+00:00 | UCtinbF-Q-fVthA0qrFQTgXQ | CaseyNeistat | 22 | 2020-08-12T00:00:00Z | [None] | 940036 | 87113 | ... | https://i.ytimg.com/vi/49Z6Mv4_WCA/default.jpg | False | False | ssend love to my sponsor; for a super Limited ... | People & Blogs | 2020-08-12 00:00:00+00:00 | Aug | 2020 | 8 | 20 |
5 rows × 22 columns
From the above first five rows, we can cross-check if the month and hour published are correctly placed. Then, we create a new list called corr_list to see the correlation among the listed variables.
Correlation means that there is a relationship between two things, but does not always mean that one causes the other. It ranges from -1 to 1, where 0 denotes 'No correlation' and 1 denotes positively correlated (If one variable increases, the other variable also increases), while -1 denotes negatively correlated (If one variable increases, the other variable decreases). We aim to see how close it is to -1 and 1 for correlation, the closer the more correlated.
We use the corr() method to get the correlation.
corr_list = ['hour','date','num_month','categoryId','year','likes','dislikes','comment_count','view_count']
df[corr_list].corr()
| hour | date | num_month | categoryId | year | likes | dislikes | comment_count | view_count | |
|---|---|---|---|---|---|---|---|---|---|
| hour | 1.000000 | 0.003967 | -0.000968 | 0.086224 | 0.004656 | -0.092060 | -0.059514 | -0.070440 | -0.104156 |
| date | 0.003967 | 1.000000 | -0.014301 | -0.038648 | 0.957254 | -0.044968 | -0.146859 | -0.036412 | -0.015463 |
| num_month | -0.000968 | -0.014301 | 1.000000 | -0.003229 | -0.285082 | 0.014485 | 0.021026 | 0.006841 | 0.007157 |
| categoryId | 0.086224 | -0.038648 | -0.003229 | 1.000000 | -0.034486 | -0.070575 | -0.016366 | -0.064965 | -0.052603 |
| year | 0.004656 | 0.957254 | -0.285082 | -0.034486 | 1.000000 | -0.047119 | -0.147075 | -0.037127 | -0.016800 |
| likes | -0.092060 | -0.044968 | 0.014485 | -0.070575 | -0.047119 | 1.000000 | 0.461104 | 0.699778 | 0.829249 |
| dislikes | -0.059514 | -0.146859 | 0.021026 | -0.016366 | -0.147075 | 0.461104 | 1.000000 | 0.429685 | 0.494723 |
| comment_count | -0.070440 | -0.036412 | 0.006841 | -0.064965 | -0.037127 | 0.699778 | 0.429685 | 1.000000 | 0.519968 |
| view_count | -0.104156 | -0.015463 | 0.007157 | -0.052603 | -0.016800 | 0.829249 | 0.494723 | 0.519968 | 1.000000 |
Now that we have the correlation, we will use seaborn's feature called heatmap() to better visualize the correlation among them.
plt.figure(figsize=(10,10))
fig = sns.heatmap(data = df[corr_list].corr(), annot=True).set_title("Figure 17 - Heatmap of Correlation Matrix")
From Figure 18, we see that the likes and view count is high-positively correlated with 0.83 value. The dislikes and view count, the comment count and view count, and the dislikes and comment count are also somewhat correlated. The rest of the variables are neglibly correlated as the value is closer to 0.
We can see a very high positive correlation between year and date, and further explore on it by comparing it with likes, comment count and view count. We will use the lineplot method to plot the graph.
f, ax = plt.subplots(2, 3, figsize=(25, 20))
variables = ['likes', 'view_count', 'comment_count']
x_values = ['hour','month']
plt.suptitle('Figure 19 - Lineplot of Likes, View Count and Comment Count vs Hour and Month',fontsize=20)
plt.subplots_adjust(top=0.85)
for i, x in enumerate(x_values):
for j, var in enumerate(variables):
sns.lineplot(x = df[x], y = df[var], data = df, ax = ax[i,j], marker = 'o').set_title(f'{var.title()} vs {x.title()}')
Figure 19 shows us the line plot for the number of likes, view count and comment count hourwise with respect to hours and month. From the figure 19, we see the majority of likes, view counts and comments are accrued for the videos pubished between 2 am and 7 am. We can also see the same trend for the videos published in the months between May and September.
Although we have a time frame to see a spike of the video, we will now use the boxplot() to better visualize the spike hourly and monthly.
f, ax = plt.subplots(2, 3, figsize=(25, 20))
variables = ['likes', 'view_count', 'comment_count']
x_values = ['hour','month']
plt.suptitle('Figure 20 - Boxplot of Likes, View Count and Comment Count', fontsize=17,position=(0.5,0.95))
plt.subplots_adjust(top=0.85)
for i, x in enumerate(x_values):
for j, var in enumerate(variables):
sns.boxplot(x = df[x], y = np.log(df[var]), data = df, ax = ax[i,j]).set_title(f'{var.title()} vs {x.title()}')
/Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs) /Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs) /Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs) /Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs) /Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs) /Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
From figure 20, we can now see the exact hour and month better. Thus, as per the graph, 5am is the best time and June is the best month to publish the video.
columns = ['likes', 'view_count', 'comment_count']
plt.figure(figsize=(16, 6))
plt.suptitle('Figure 21 - Boxplot of Likes, View Count and Comment Count', fontsize=12,position=(0.4,1.05))
plt.subplots_adjust(top=0.85)
for i, column in enumerate(columns):
plt.subplot(2, 4, i + 1)
sns.boxplot(x=np.log(df[column]))
plt.title(column.title())
/Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs) /Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs) /Applications/anaconda3/lib/python3.11/site-packages/pandas/core/arraylike.py:396: RuntimeWarning: divide by zero encountered in log result = getattr(ufunc, method)(*inputs, **kwargs)
Tried plotting with the actual count of likes, view_count and comment_count, but the code cell couldnt be compiled. So, we went for the log of each values.
To begin with our analysis, we import the sci-kit-learn Python library. Scikit-learn contains a variety of practical classes that can help us with regression modelling. We import train_test_split, LinearRegression and RandomForestRegressor for modelling. We also import mean_squared_error and r2_score to help us in evaluating the model.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import MinMaxScaler
Since our dataset contains categorical data like categoryId, Month, Hour and year, we use OneHotEncoding function to create dummy variables.
Similarly, our dataset contains numerical values in likes, dislikes and comment_count which have varying numerical scale. We can thus use MinMaxScaler from Scikit-Learn to scale our numerical columns. Scikit-Learn alos contains StandardScaler, but we are using MinMaxScaler since it scales the data from [0,1] and preserves the shape of the dataset.
Finally, we drop all unnecessary columns in our dataframe to reduce computation complexity and then create a 20:80 test/train datasets.
df_oneHotEncode = df.copy()
#list of columns to be one hot encoded
ohe_columns = ['categoryId','num_month','hour','year']
#one hot encode the columns
df_oneHotEncode = pd.get_dummies(df_oneHotEncode, columns=ohe_columns)
# drop columns that are not needed - title, publishedAt, trending_date, date, month, year, description, video_id, channelId, tags, channelTitle, thumbnail_link, comments_disabled, ratings_disabled, comment_count, likes, dislikes, view_count, category, likes_mean, likes_50, view_count_mean, view_count_50, view_count_25, view_to_like_ratio, like_to_view_ratio, dislikes_zero, dislikes_nonzero
df_oneHotEncode = df_oneHotEncode.drop(['title','publishedAt','trending_date','date','month','description','video_id','channelId','tags','channelTitle','thumbnail_link','comments_disabled','ratings_disabled','category'], axis=1)
# scale the numerical columns
scaler = MinMaxScaler()
df_oneHotEncode[['likes', 'dislikes', 'comment_count']] = scaler.fit_transform(df_oneHotEncode[['likes', 'dislikes', 'comment_count']])
df_oneHotEncode.head()
target = df_oneHotEncode.drop(['view_count'], axis=1)
feature = df_oneHotEncode['view_count']
X_train, X_test, y_train, y_test = train_test_split(target, feature, test_size=0.2, random_state=42)
Secondly, we instantiate a linear regression model from the Scikit-Learn library and assign it to a variable called model. The object's fit method is used to train and create a linear regression model using our training dataset X_train. Finally, using the created model, we use our testing dataset X_test to predict our view count.
#fit the model
model = LinearRegression()
model.fit(X_train, y_train)
#predict the model
y_pred = model.predict(X_test)
We can then print the Mean Squared Error, R**2 and Root Mean Squared Error.
#evaluate the model
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
print('Variance score: %.2f' % r2_score(y_test, y_pred))
print('Root Mean squared error: %.2f' % mean_squared_error(y_test, y_pred, squared=False))
Mean squared error: 10033078753476.97 Variance score: 0.73 Root Mean squared error: 3167503.55
From the output above, we get a variance score of 0.72. While this score is close to 1, we can try another model to ensure we get the best possible score.
When we plot the graph for Linear Regression, we can observe that while Linear Regression can predict view counts for trending videos up to 1e8 views, it fails to accurately predict higher view counts, thus lowering our variance score.
#plot the model
plt.scatter(y_test, y_pred)
plt.xlabel("Actual view count")
plt.ylabel("Predicted view count")
plt.title("Figure 22 - Actual vs Predicted view count")
plt.show()
We can now use Linear Regression as our baseline model to evaluate our next model. We are using RandomForestRegressor from Scikit-Learn to implement our model.
RandomForestRegressor needs an argument called n_estimtor that determines the number of decision tree classifiers to use. To check the appropriate integer for the n_estimator, we can create a function that takes an n as the number estimator. The function returns MSE, R2 and RMSE values.
# def RandomForestList(n):
# model = RandomForestRegressor(n_estimators=n, random_state=201750985)
# model.fit(X_train, y_train.values.ravel())
# y_pred = model.predict(X_test)
# return mean_squared_error(y_test, y_pred), r2_score(y_test, y_pred), mean_squared_error(y_test, y_pred, squared=False)
We can then pass the function through a for loop that iterates over the list of estimators as defined below. We can then append the resultant scores against the n_estimators to a new data frame.
# rf_df = pd.DataFrame(columns=['n_estimators', 'MSE', 'RMSE', 'R2'])
# n_estimates = [1, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50,55 ,60 ,65 ,70, 75 ,80 ,85 ,90 ,95 ,100]
# for i in n_estimates:
# mse,r2,rmse = RandomForestList(i)
# rf_df = pd.concat([rf_df, pd.DataFrame([{'n_estimators': i, 'MSE': mse, 'RMSE': rmse, 'R2': r2}])], ignore_index=True)
# #rf_df = rf_df.append(, ignore_index=True)
# rf_df.head(100)
We can now plot RMSE vs n_estimator and R2 vs n_estimator. From the graph, we see that an increase in n_estimnator initially causes a rapid rise in R2 value, but only a marginal increase from n_estimator greater than 15. Furthermore, any increase in the value of n_estimnator does not cause any change in the value of R2. Thus, we decide that the value of n_estimator should be 50 to maximise our R2 while also minimising compute time.
# fig, axes = plt.subplots(1, 2, figsize=(15, 5))
# axes[0].plot(rf_df['n_estimators'], rf_df['RMSE'])
# axes[0].set_xlabel('n_estimators')
# axes[0].set_ylabel('RMSE')
# axes[0].set_title('Figure 9 - RMSE vs n_estimators')
# axes[0].axvline(x=50, color='r', linestyle='--')
# axes[1].plot(rf_df['n_estimators'], rf_df['R2'], color='orange')
# axes[1].set_xlabel('n_estimators')
# axes[1].set_ylabel('R2')
# axes[1].set_title('Figure 10 - R2 vs n_estimators')
# axes[1].axvline(x=50, color='r', linestyle='--')
# plt.show()
After determining 50 as the ideal value of n_estimator, we can start training our Random Forest Model. AFter training with X_train and y_train, we can test our model with the X_test dataset.
# # create regressor object
# regressor = RandomForestRegressor(n_estimators = 50, random_state = 201750985)
# regressor.fit(X_train, y_train.values.ravel())
# # predict the result
# y_pred = regressor.predict(X_test)
Printing the scores for our model shows a significant increase in the R2 value from 0.72 to 0.95. Thus, we can choose the RandomForestRegressor with n_estimator as 50 for our prediction Model.
# print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))
# print('Variance score: %.2f' % r2_score(y_test, y_pred))
# print('Root Mean squared error: %.2f' % mean_squared_error(y_test, y_pred, squared=False))
Plotting the graph of actual vs prediction shows our model has improved over just Linear Regression in determining View_count of Videos given a set of attributes about the video.
# #plot the model
# plt.scatter(y_test, y_pred)
# plt.xlabel("Actual view count")
# plt.ylabel("Predicted view count")
# plt.title("Actual vs Predicted view count")
# plt.show()
This section should describe the outcome of the project by means of both explanation of the results and by graphical visualisation in the form of graphs, charts or or other kinds of diagram
The section should begin with a general overview of the results and then have a section for each of the project objectives. For each of these objectives an explanation of more specific results relating to that objective shoud be given, followed by a section presenting some visualisation of the results obtained. (In the case where the project had just one objective, you should still have a section describing the results from a general perspective followed by a section that focuses on the particular objective.)
The marks for this section will be divided into 10 marks for Explanation and 10 marks for Visualisation. These marks will be awarded for the Project Outcome section as a whole, not for each objective individually. Hence, you do not have to pay equal attention to each. However, you are expected to have a some explanation and visualisation for each. It is suggested you have 200-400 words explanation for each objective.
Give a general overview of the results (around 200 words).
We can deduce two conclusions:
Checks the key words to be used in the Title and/or Description for a video to be in the trending category
Now, we will probe into investigating the keywords used in the title, description and how frequent the keywords are among the list of trending categories in each category. We first created a new variable ‘title_length’ to store the number of words in each title. As we can see in Figure 4, most of the trending videos have 8 words in the title, and there are almost negligible trending videos that have over 20 words.
We then studied into finding actual keywords that are significant in the title and Video tags. We used the word cloud to plot the keywords as Figure 5 for the title and Figure 6 for the video tags respectively, for each category. In Figure 5, we can see the ones that are dark and big are more used keywords and of the highest significance. Thus, in the Entertainment category, keywords like ‘Official’, and ’Trailer’ are used more, while ‘Slow Mo’ is of lesser significance. Similarly in Figure 6, ‘Manchester United’ has the highest significance in Entertainment category.
The same goes for the rest of the categories, where ‘Official Video’ tops the list across a few categories, followed by the local keywords such as ‘Prime Minister’ for News & Politics, ‘League Highlights’ for Sports, ‘iPhone Pro’ for Science & Technology, etc. in regards to the keywords used in the title.
Now that we observed the keywords used, we then generated a word cloud that has the common keywords between the title and the video tags. From Figure 7, we spot that “Season” is the most commonly used keyword in the title as well as in the video tags in the Entertainment category.
Based on the observations made, we can deduce the following key points,
Your concluding section should be around 200-400 words. It is recommended that you divide it into the following sections.
As we had expected, the most popular fridge magnets were of the 'meme' kind. We were surprised that 'smiley' fridge magnets were less common than expected. We conjecture that this is because, although they are apparently very popular, few fridges display more than one smiley. However, 'meme' based magnets can be found in large numbers, even on quite small fridges.
The limitations include the following,
Modelling RandomFOirest for large scale datasets like the Yotube APi , mention O notation for models
For future works, we can enhance the project by further probing into the following,